Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [136]:
# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

Loading the dataset¶

In [137]:
df_train=pd.read_csv("Train.csv")
df_test=pd.read_csv("Test.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [138]:
df_train.head()
Out[138]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.465 -4.679 3.102 0.506 -0.221 -2.033 -2.911 0.051 -1.522 3.762 -5.715 0.736 0.981 1.418 -3.376 -3.047 0.306 2.914 2.270 4.395 -2.388 0.646 -1.191 3.133 0.665 -2.511 -0.037 0.726 -3.982 -1.073 1.667 3.060 -1.690 2.846 2.235 6.667 0.444 -2.369 2.951 -3.480 0
1 3.366 3.653 0.910 -1.368 0.332 2.359 0.733 -4.332 0.566 -0.101 1.914 -0.951 -1.255 -2.707 0.193 -4.769 -2.205 0.908 0.757 -5.834 -3.065 1.597 -1.757 1.766 -0.267 3.625 1.500 -0.586 0.783 -0.201 0.025 -1.795 3.033 -2.468 1.895 -2.298 -1.731 5.909 -0.386 0.616 0
2 -3.832 -5.824 0.634 -2.419 -1.774 1.017 -2.099 -3.173 -2.082 5.393 -0.771 1.107 1.144 0.943 -3.164 -4.248 -4.039 3.689 3.311 1.059 -2.143 1.650 -1.661 1.680 -0.451 -4.551 3.739 1.134 -2.034 0.841 -1.600 -0.257 0.804 4.086 2.292 5.361 0.352 2.940 3.839 -4.309 0
3 1.618 1.888 7.046 -1.147 0.083 -1.530 0.207 -2.494 0.345 2.119 -3.053 0.460 2.705 -0.636 -0.454 -3.174 -3.404 -1.282 1.582 -1.952 -3.517 -1.206 -5.628 -1.818 2.124 5.295 4.748 -2.309 -3.963 -6.029 4.949 -3.584 -2.577 1.364 0.623 5.550 -1.527 0.139 3.101 -1.277 0
4 -0.111 3.872 -3.758 -2.983 3.793 0.545 0.205 4.849 -1.855 -6.220 1.998 4.724 0.709 -1.989 -2.633 4.184 2.245 3.734 -6.313 -5.380 -0.887 2.062 9.446 4.490 -3.945 4.582 -8.780 -3.383 5.107 6.788 2.044 8.266 6.629 -10.069 1.223 -3.230 1.687 -2.164 -3.645 6.510 0
In [139]:
df_test.head()
Out[139]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613 -3.820 2.202 1.300 -1.185 -4.496 -1.836 4.723 1.206 -0.342 -5.123 1.017 4.819 3.269 -2.984 1.387 2.032 -0.512 -1.023 7.339 -2.242 0.155 2.054 -2.772 1.851 -1.789 -0.277 -1.255 -3.833 -1.505 1.587 2.291 -5.411 0.870 0.574 4.157 1.428 -10.511 0.455 -1.448 0
1 0.390 -0.512 0.527 -2.577 -1.017 2.235 -0.441 -4.406 -0.333 1.967 1.797 0.410 0.638 -1.390 -1.883 -5.018 -3.827 2.418 1.762 -3.242 -3.193 1.857 -1.708 0.633 -0.588 0.084 3.014 -0.182 0.224 0.865 -1.782 -2.475 2.494 0.315 2.059 0.684 -0.485 5.128 1.721 -1.488 0
2 -0.875 -0.641 4.084 -1.590 0.526 -1.958 -0.695 1.347 -1.732 0.466 -4.928 3.565 -0.449 -0.656 -0.167 -1.630 2.292 2.396 0.601 1.794 -2.120 0.482 -0.841 1.790 1.874 0.364 -0.169 -0.484 -2.119 -2.157 2.907 -1.319 -2.997 0.460 0.620 5.632 1.324 -1.752 1.808 1.676 0
3 0.238 1.459 4.015 2.534 1.197 -3.117 -0.924 0.269 1.322 0.702 -5.578 -0.851 2.591 0.767 -2.391 -2.342 0.572 -0.934 0.509 1.211 -3.260 0.105 -0.659 1.498 1.100 4.143 -0.248 -1.137 -5.356 -4.546 3.809 3.518 -3.074 -0.284 0.955 3.029 -1.367 -3.412 0.906 -2.451 0
4 5.828 2.768 -1.235 2.809 -1.642 -1.407 0.569 0.965 1.918 -2.775 -0.530 1.375 -0.651 -1.679 -0.379 -4.443 3.894 -0.608 2.945 0.367 -5.789 4.598 4.450 3.225 0.397 0.248 -2.362 1.079 -0.473 2.243 -3.591 1.774 -1.502 -2.227 4.777 -6.560 -0.806 -0.276 -3.858 -0.538 0
In [140]:
df_train.shape
Out[140]:
(20000, 41)

There are 20,000 rows and 41 attributes (including the predictor) in the training dataset

In [141]:
df_test.shape
Out[141]:
(5000, 41)

There are 5000 rows and 41 attributes (including the predictor) in the test dataset

In [142]:
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

All variables except target are float type

In [143]:
df_train.duplicated().sum()
Out[143]:
0
In [144]:
df_train.isnull().sum()
Out[144]:
V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64

There are 18 missing values for attribute "V1" and 18 missing values for attribute "V2"

In [145]:
df_train.describe().T
Out[145]:
count mean std min 25% 50% 75% max
V1 19982.000 -0.272 3.442 -11.876 -2.737 -0.748 1.840 15.493
V2 19982.000 0.440 3.151 -12.320 -1.641 0.472 2.544 13.089
V3 20000.000 2.485 3.389 -10.708 0.207 2.256 4.566 17.091
V4 20000.000 -0.083 3.432 -15.082 -2.348 -0.135 2.131 13.236
V5 20000.000 -0.054 2.105 -8.603 -1.536 -0.102 1.340 8.134
V6 20000.000 -0.995 2.041 -10.227 -2.347 -1.001 0.380 6.976
V7 20000.000 -0.879 1.762 -7.950 -2.031 -0.917 0.224 8.006
V8 20000.000 -0.548 3.296 -15.658 -2.643 -0.389 1.723 11.679
V9 20000.000 -0.017 2.161 -8.596 -1.495 -0.068 1.409 8.138
V10 20000.000 -0.013 2.193 -9.854 -1.411 0.101 1.477 8.108
V11 20000.000 -1.895 3.124 -14.832 -3.922 -1.921 0.119 11.826
V12 20000.000 1.605 2.930 -12.948 -0.397 1.508 3.571 15.081
V13 20000.000 1.580 2.875 -13.228 -0.224 1.637 3.460 15.420
V14 20000.000 -0.951 1.790 -7.739 -2.171 -0.957 0.271 5.671
V15 20000.000 -2.415 3.355 -16.417 -4.415 -2.383 -0.359 12.246
V16 20000.000 -2.925 4.222 -20.374 -5.634 -2.683 -0.095 13.583
V17 20000.000 -0.134 3.345 -14.091 -2.216 -0.015 2.069 16.756
V18 20000.000 1.189 2.592 -11.644 -0.404 0.883 2.572 13.180
V19 20000.000 1.182 3.397 -13.492 -1.050 1.279 3.493 13.238
V20 20000.000 0.024 3.669 -13.923 -2.433 0.033 2.512 16.052
V21 20000.000 -3.611 3.568 -17.956 -5.930 -3.533 -1.266 13.840
V22 20000.000 0.952 1.652 -10.122 -0.118 0.975 2.026 7.410
V23 20000.000 -0.366 4.032 -14.866 -3.099 -0.262 2.452 14.459
V24 20000.000 1.134 3.912 -16.387 -1.468 0.969 3.546 17.163
V25 20000.000 -0.002 2.017 -8.228 -1.365 0.025 1.397 8.223
V26 20000.000 1.874 3.435 -11.834 -0.338 1.951 4.130 16.836
V27 20000.000 -0.612 4.369 -14.905 -3.652 -0.885 2.189 17.560
V28 20000.000 -0.883 1.918 -9.269 -2.171 -0.891 0.376 6.528
V29 20000.000 -0.986 2.684 -12.579 -2.787 -1.176 0.630 10.722
V30 20000.000 -0.016 3.005 -14.796 -1.867 0.184 2.036 12.506
V31 20000.000 0.487 3.461 -13.723 -1.818 0.490 2.731 17.255
V32 20000.000 0.304 5.500 -19.877 -3.420 0.052 3.762 23.633
V33 20000.000 0.050 3.575 -16.898 -2.243 -0.066 2.255 16.692
V34 20000.000 -0.463 3.184 -17.985 -2.137 -0.255 1.437 14.358
V35 20000.000 2.230 2.937 -15.350 0.336 2.099 4.064 15.291
V36 20000.000 1.515 3.801 -14.833 -0.944 1.567 3.984 19.330
V37 20000.000 0.011 1.788 -5.478 -1.256 -0.128 1.176 7.467
V38 20000.000 -0.344 3.948 -17.375 -2.988 -0.317 2.279 15.290
V39 20000.000 0.891 1.753 -6.439 -0.272 0.919 2.058 7.760
V40 20000.000 -0.876 3.012 -11.024 -2.940 -0.921 1.120 10.654
Target 20000.000 0.056 0.229 0.000 0.000 0.000 0.000 1.000
In [146]:
df_train["Target"].sum()
Out[146]:
1110

Out of 20000 values 1110 values of our target are '1's. We would generalize to aim to oversample the data by looking at this number. Lets look at cross validation results from training and validation and chose to go for oversample or undersample later in this model

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

In [147]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Plotting all the features at one go¶

In [148]:
for feature in df_train.columns:
    histogram_boxplot(df_train, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
In [149]:
for feature in df_test.columns:
    histogram_boxplot(df_test, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
In [150]:
plt.figure(figsize=(30, 20))
sns.heatmap(df_train.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

V2 - V26 V7 - V15 V8 - V16 V11 - V29 V16 - V21 V19 - V34 All these combinations are showing positive correlation

In [151]:
plt.figure(figsize=(30, 20))
sns.heatmap(df_test.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Data Pre-processing¶

In [152]:
X = df_train.drop(["Target"], axis=1)
y=df_train["Target"]
In [153]:
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(12000, 40) (4000, 40) (4000, 40)

Missing value imputation¶

In [154]:
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="median")
cols_to_impute = ["V1", "V2"]

# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])

# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])

# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])

We will use median to impute missing values in "V1" and "V2" columns.

In [155]:
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64

All missing values have been treated.

In [156]:
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [157]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1
            
        },
        index=[0],
    )

    return df_perf
In [158]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [159]:
scorer = metrics.make_scorer(metrics.recall_score)
In [160]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    score.append(scores)
    print("{}: {}".format(name, scores))
Cross-Validation Performance:

Bagging: 69.51520592526091
Random forest: 71.02457636628885
GBM: 70.12007631017842
Adaboost: 62.46661429693636
Xgboost: 79.27729772191674
dtree: 73.27236000448882

Validation Performance:

Bagging: 0.6891891891891891
Random forest: 0.7072072072072072
GBM: 0.7387387387387387
Adaboost: 0.6351351351351351
Xgboost: 0.7972972972972973
dtree: 0.7207207207207207
In [161]:
print("\n" "Training Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9459459459459459
Random forest: 0.9984984984984985
GBM: 0.8348348348348348
Adaboost: 0.6261261261261262
Xgboost: 1.0
dtree: 1.0

The cross validation training performance scores are similar to the validation perfromance score. There is a tendency for some models (decision tree, random forest, bagging and XGBoost) to overfit the training set; as the training performance score approaches 1.

In [162]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()
  • We can see that the xgboost is giving the highest cross-validated recall followed by decision tree
  • The boxplot shows that the performance of decision tree and xgboost is consistent and their performance on the validation set is also good
  • We will tune the best two models i.e. decision tree and xgboost and see if the performance improves
In [163]:
# Use function DecisionTreeClassifier from sklearn to build model - consider `gini` criterion to split data at nodes
dtree = DecisionTreeClassifier(criterion="gini", random_state=1)
dtree.fit(X_train, y_train)
Out[163]:
DecisionTreeClassifier(random_state=1)
In [164]:
# User-defined function to plot the confusion_matrix of a classification model built using sklearn based on test set
def make_confusion_matrix(model):
    """
    model: classifier to predict values of Y
    """
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.title("Test Set's Confusion Matrix", fontsize=16)
    plt.ylabel("Actual Label", fontsize=15)
    plt.xlabel("Predicted Label", fontsize=15)
In [165]:
# Create confusion matrix based on test data set
make_confusion_matrix(dtree)

# Check performance of model on both training and test data sets
perf_dcsn_tree = model_performance_classification_sklearn(dtree, X_train, y_train)
perf_dcsn_tree
Out[165]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [166]:
# Create confusion matrix based on test data set
make_confusion_matrix(dtree)

# Check performance of model on both training and test data sets
perf_dcsn_tree_val = model_performance_classification_sklearn(dtree, X_val, y_val)
perf_dcsn_tree_val
Out[166]:
Accuracy Recall Precision F1
0 0.969 0.721 0.721 0.721

Model Building with original data¶

Sample Decision Tree model building with original data

In [167]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.7327236000448882

Validation Performance:

dtree: 0.7207207207207207

Model Building with Oversampled data¶

In [168]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [169]:
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 666
Before UpSampling, counts of label 'No': 11334 

After UpSampling, counts of label 'Yes': 11334
After UpSampling, counts of label 'No': 11334 

After UpSampling, the shape of train_X: (22668, 40)
After UpSampling, the shape of train_y: (22668,) 

In [170]:
# Use function DecisionTreeClassifier from sklearn to build model - consider `gini` criterion to split data at nodes
dtree = DecisionTreeClassifier(criterion="gini", random_state=1)
dtree.fit(X_train_over, y_train_over)
Out[170]:
DecisionTreeClassifier(random_state=1)
In [171]:
# Create confusion matrix based on test data set
make_confusion_matrix(dtree)

# Check performance of model on both training and test data sets
perf_dcsn_tree_over = model_performance_classification_sklearn(dtree, X_train_over, y_train_over)
perf_dcsn_tree_over
Out[171]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [172]:
models_over = []  # Empty list to store all the models

# Appending models into the list
models_over.append(("Bagging", BaggingClassifier(random_state=1)))
models_over.append(("Random forest", RandomForestClassifier(random_state=1)))
models_over.append(("GBM", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models_over.append(("Xgboost", DecisionTreeClassifier(random_state=1)))
models_over.append(("dtree", XGBClassifier(random_state=1, eval_metric="logloss")))
results_over = []  # Empty list to store all model's CV scores
names_over = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score

print("\n" "Cross-Validation Performance:" "\n")
for name, model in models_over:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results_over.append(cv_result)
    names_over.append(name)
    print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    score.append(scores)
    print("{}: {}".format(name, scores))    
    
Cross-Validation Performance:

Bagging: 0.9762662881334749
Random forest: 0.9848244761264405
GBM: 0.9241221470338262
Adaboost: 0.8959765404936946
Xgboost: 0.9693842074260146
dtree: 0.9910888837929835

Validation Performance:

dtree: 0.7207207207207207
In [173]:
print("\n" "Validation Performance:" "\n")

for name, model in models_over:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Validation Performance:

Bagging: 0.6891891891891891
Random forest: 0.7072072072072072
GBM: 0.7387387387387387
Adaboost: 0.6351351351351351
Xgboost: 0.7207207207207207
dtree: 0.7972972972972973
In [174]:
print("\n" "Training Performance:" "\n")

for name, model in models_over:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9987647785424387
Random forest: 1.0
GBM: 0.9299452973354508
Adaboost: 0.9021528145403211
Xgboost: 1.0
dtree: 1.0

The cross validation training performance scores are much higher than validation perfromance score. This indicates that the default algorithms on oversampled dataset are not able to generalize well

In [175]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results_over)
ax.set_xticklabels(names_over)

plt.show()

The average (& median) training cross validation scores on oversampled dataset has increased to match training performance scores across decision tree algorithms. This indicates potential overfitting of noise in the training datasets

Model Building with Undersampled data¶

In [176]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [177]:
# Create confusion matrix based on test data set
make_confusion_matrix(dtree)

# Check performance of model on both training and test data sets
perf_dcsn_tree_und = model_performance_classification_sklearn(dtree, X_train_un, y_train_un)
perf_dcsn_tree_und
Out[177]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [178]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 666
Before Under Sampling, counts of label 'No': 11334 

After Under Sampling, counts of label 'Yes': 666
After Under Sampling, counts of label 'No': 666 

After Under Sampling, the shape of train_X: (1332, 40)
After Under Sampling, the shape of train_y: (1332,) 

In [179]:
models_un = []  # Empty list to store all the models

# Appending models into the list
models_un.append(("Bagging", BaggingClassifier(random_state=1)))
models_un.append(("Random forest", RandomForestClassifier(random_state=1)))
models_un.append(("GBM", GradientBoostingClassifier(random_state=1)))
models_un.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models_un.append(("dtree", DecisionTreeClassifier(random_state=1)))
models_un.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
results_un = []  # Empty list to store all model's CV scores
names_un = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score

print("\n" "Cross-Validation Performance:" "\n")

for name, model in models_un:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results_un.append(cv_result)
    names_un.append(name)
    print("{}: {}".format(name, cv_result.mean()))
Cross-Validation Performance:

Bagging: 0.8559084277858826
Random forest: 0.8934238581528448
GBM: 0.8904500056110425
Adaboost: 0.8694422623723487
dtree: 0.8303445180114466
Xgboost: 0.8829087644484345
In [180]:
print("\n" "Validation Performance:" "\n")

for name, model in models_un:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Validation Performance:

Bagging: 0.8963963963963963
Random forest: 0.9099099099099099
GBM: 0.9009009009009009
Adaboost: 0.8873873873873874
dtree: 0.8873873873873874
Xgboost: 0.8963963963963963
In [181]:
print("\n" "Training Performance:" "\n")

for name, model in models_un:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.978978978978979
Random forest: 1.0
GBM: 0.9429429429429429
Adaboost: 0.9054054054054054
dtree: 1.0
Xgboost: 1.0

The performance score have dropped on the validation undersampled dataset than original dataset.

In [182]:
# Plotting boxplots for CV scores of all models defined above

fig = plt.figure(figsize=(10, 4))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results_un)
ax.set_xticklabels(names_un)
plt.show()

The algorithms are able to give better performance on the cross validation training scores on undersampled dataset in comparison to original dataset as can be seen from the boxplots.

HyperparameterTuning¶

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

  • For Adaboost:

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

  • For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

  • For Random Forest:

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

  • For Decision Trees:

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

  • For Logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

  • For XGBoost:

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Sample tuning method for Decision tree with original data¶

In [183]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7], 
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5316462798788015:

Sample tuning method for Decision tree with oversampled data¶

In [184]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7], 
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 2} with CV score=0.9123889677716:

Sample tuning method for Decision tree with undersampled data¶

In [185]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
              'min_samples_leaf': [1, 2, 5, 7], 
              'max_leaf_nodes' : [5, 10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 11} with CV score=0.8363483335203681:

XGBoost Hyperparameter Tuning¶

In [186]:
xgb = XGBClassifier(random_state=1,eval_metric='logloss')
In [187]:
param_grid = {
    'n_estimators':[150,200,250],
    'scale_pos_weight':[5,10], 
    'learning_rate':[0.1,0.2],
    'gamma':[0,3,5],
    'subsample':[0.8,0.9]
    }
In [188]:
from sklearn.model_selection import RandomizedSearchCV 

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
In [189]:
randomized_cv.fit(X_train_over,y_train_over) ## Complete the code to fit the model on over sampled data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9956767169772682:
In [190]:
xgb2 = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.9,
    scale_pos_weight=10,
    n_estimators=200,
    learning_rate=0.1,
    gamma=5,
)## Complete the code with the best parameters obtained from tuning

# Fit the model on training data
xgb2.fit(X_train, y_train) 
Out[190]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=5, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=200, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=1, ...)
In [191]:
xgboost_train_perf = model_performance_classification_sklearn(xgb2, X_train_over, y_train_over)
xgboost_train_perf
Out[191]:
Accuracy Recall Precision F1
0 0.965 0.931 0.999 0.964
In [192]:
xgboost_grid_val = model_performance_classification_sklearn(xgb2, X_val, y_val)
print("Validation performance:")
xgboost_grid_val
Validation performance:
Out[192]:
Accuracy Recall Precision F1
0 0.990 0.847 0.974 0.906

The best hyperparameters using RandomizedSearch CV for XGBoost model were found to be: subsample 0.9, scale_pos_weight 10, n_estimators 200, learning_rate 0.1 and gamma 5

In [193]:
confusion_matrix_sklearn(xgb2, X_val, y_val)

Random Forest Hyperparameter Tuning¶

In [194]:
model = RandomForestClassifier(random_state=1)
param_grid = {
    "n_estimators": [200, 250, 300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1), "sqrt"],
    "max_samples": np.arange(0.4, 0.7, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_grid,
    n_iter=50,
    scoring=scorer,
    cv=5,
    random_state=1,
    n_jobs=-1,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.8948490629558972:
In [195]:
# building model with best parameters
rf_tuned1 = RandomForestClassifier(
    n_estimators=300,
    min_samples_leaf=1,
    max_samples=0.6,
    max_features="sqrt",
    random_state=1,
)

# Fit the model on training data
rf_tuned1.fit(X_train_un, y_train_un)
Out[195]:
RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=300,
                       random_state=1)
In [196]:
# Calculating different metrics on training set
rf_random_train = model_performance_classification_sklearn(
    rf_tuned1, X_train_un, y_train_un
)
print("Training performance:")
rf_random_train
Training performance:
Out[196]:
Accuracy Recall Precision F1
0 0.993 0.986 1.000 0.993
In [197]:
# Calculating different metrics on validation set
rf_random_val = model_performance_classification_sklearn(rf_tuned1, X_val, y_val)
print("Validation performance:")
rf_random_val
Validation performance:
Out[197]:
Accuracy Recall Precision F1
0 0.947 0.905 0.513 0.655

The best hyperparameters using RandomizedSearch CV for Random forest model were found to be: max_features='sqrt', max_samples=0.6,n_estimators=300, random_state=1.

In [198]:
confusion_matrix_sklearn(rf_tuned1, X_val, y_val)

Bagging Classifier Hyperparameter Tuning¶

In [199]:
model2 = BaggingClassifier(random_state=1)

param_grid2 = {
    "max_samples": [0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9],
    "n_estimators": [30, 50, 70],
}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling RandomizedSearchCV
randomized_cv2 = RandomizedSearchCV(
    estimator=model2,
    param_distributions=param_grid2,
    n_iter=50,
    scoring=scorer,
    cv=5,
    random_state=1,
    n_jobs=-1,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv2.fit(X_train_un, y_train_un)
print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv2.best_params_, randomized_cv2.best_score_
    )
)
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.8} with CV score=0.8933901918976546:
In [200]:
# building model with best parameters
bagging_tuned = BaggingClassifier(
    n_estimators=70, max_samples=0.9, max_features=0.8, random_state=1,
)

# Fit the model on training data
bagging_tuned.fit(X_train_un, y_train_un)
Out[200]:
BaggingClassifier(max_features=0.8, max_samples=0.9, n_estimators=70,
                  random_state=1)
In [201]:
# Calculating different metrics on train set
bagging_random_train = model_performance_classification_sklearn(
    bagging_tuned, X_train_un, y_train_un
)
print("Training performance:")
bagging_random_train
Training performance:
Out[201]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [202]:
# Calculating different metrics on validation set
bagging_random_val = model_performance_classification_sklearn(
    bagging_tuned, X_val, y_val
)
print("Validation performance:")
bagging_random_val
Validation performance:
Out[202]:
Accuracy Recall Precision F1
0 0.941 0.887 0.483 0.625
In [203]:
# creating confusion matrix
confusion_matrix_sklearn(bagging_tuned, X_val, y_val)

The best hyperparameters using RandomizedSearch CV for Bagging Classifier were found to be: n_estimator 70, max_samples 0.9, max_features 0.8 The average 5 fold cross validation training performance score using the best parameter Bagging classifier is 0.89. This is similar to the performance score on the validation set The model does however have a tendency to overfit the training set as can be observed from training performance.

Model performance comparison and choosing the final model¶

In [204]:
# training performance comparison

models_train_comp_df = pd.concat(
    [xgboost_train_perf.T, rf_random_train.T, bagging_random_train.T, perf_dcsn_tree.T], axis=1,
)
models_train_comp_df.columns = [
    "XGBoost Tuned -Undersampled",
    "Random forest Tuned-undersampled",
    "Bagging Tuned -undersampled",
    "Decision Tree"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[204]:
XGBoost Tuned -Undersampled Random forest Tuned-undersampled Bagging Tuned -undersampled Decision Tree
Accuracy 0.965 0.993 1.000 1.000
Recall 0.931 0.986 1.000 1.000
Precision 0.999 1.000 1.000 1.000
F1 0.964 0.993 1.000 1.000
In [205]:
models_val_comp_df = pd.concat(
    [xgboost_grid_val.T, rf_random_val.T, bagging_random_val.T, perf_dcsn_tree_val.T], axis=1,
)
models_val_comp_df.columns = [
    "XGBoost Tuned-Undersampled",
    "Random forest Tuned-undersampled",
    "Bagging Tuned-undersampled",
    "Decision Tree"
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[205]:
XGBoost Tuned-Undersampled Random forest Tuned-undersampled Bagging Tuned-undersampled Decision Tree
Accuracy 0.990 0.947 0.941 0.969
Recall 0.847 0.905 0.887 0.721
Precision 0.974 0.513 0.483 0.721
F1 0.906 0.655 0.625 0.721

Test set final performance¶

In [206]:
rf_tuned = RandomForestClassifier(
    n_estimators=250,
    min_samples_leaf=1,
    max_samples=0.5000000000000001,
    max_features="sqrt",
    random_state=1,
)

# Fit the model on test data
rf_tuned.fit(X_test, y_test)
Out[206]:
RandomForestClassifier(max_features='sqrt', max_samples=0.5000000000000001,
                       n_estimators=250, random_state=1)
In [207]:
# Calculating different metrics on test set
rf_random_test = model_performance_classification_sklearn(rf_tuned, X_test, y_test)
print("Test performance:")
rf_random_test
Test performance:
Out[207]:
Accuracy Recall Precision F1
0 0.991 0.833 0.995 0.907

The RandomForest tuned model is generalizing well on the test data with a recall of 0.83 .

In [214]:
feature_names = df_test.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Pipelines to build the final model¶

In [209]:
X_train_pipeline = df_train.drop("Target", axis=1)
y_train_pipeline = df_train["Target"]
In [210]:
X_test_pipeline = df_test.drop("Target", axis=1)
y_test_pipeline = df_test["Target"]
In [211]:
model_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        (
            "RandomForest",
            RandomForestClassifier(
                n_estimators=250,
                min_samples_leaf=1,
                max_samples=0.5000000000000001,
                max_features="sqrt",
                random_state=1,
            ),
        ),
    ]
)
# Fit the model on training data
model_pipeline.fit(X_train_pipeline, y_train_pipeline)
Out[211]:
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('RandomForest',
                 RandomForestClassifier(max_features='sqrt',
                                        max_samples=0.5000000000000001,
                                        n_estimators=250, random_state=1))])
In [212]:
model_pipeline.predict(X_test_pipeline)
Out[212]:
array([0, 0, 0, ..., 0, 0, 0])
In [213]:
# Let's check the performance on test set
Model_test = model_performance_classification_sklearn(model_pipeline, X_test, y_test)
Model_test
Out[213]:
Accuracy Recall Precision F1
0 0.992 0.865 0.990 0.923

Business Insights and Conclusions¶

From our objective its clear that we need to calculate the best recall value.

We are aimed to build a machine learning model that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost

We chose random forest as our final model after comparing the results of oversample and undersample data and hyperparameter tuning parameters.

A pipeline was additionally built for the final chosen model.

The main attributes of importance for predicting failures vs. no failures were found to be "V18", "V21", "V35", "V12" & "V15" in order of decreasing importance. This helps in collecting more frequent sensor information to be used in improving the machine learning model to further decrease maintenance costs.